The MIT License (MIT)
Copyright (c) 2016 Erika Fille Legara
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Real-world networks reveal clustering behaviour, which is exhibited in the formation of communities/clusters/partitions (terms are used interchangeably in this notebook) in the graph structure. When studying networks, their structure and function, it is crucial to identify these bunchings. In this recipe, we explore these different structures (including cliques and connected components) with a strong focus on clustering and community detection. In particular, we explore two general methods, and three submethods:
Note: This notebook is a supplementary material to Michael Lees and Debraj Roy's A Short Practical Introduction to NetworkX for the 2016 NTU Winter School on Complex Systems and also to the presentation on "Community Structures" by the author.
Import all necessary packages for this notebook to run smoothly. If you are encountering errors in importing some of the packages, please check the 0. Introduction and Setup notebook by Michael and Debraj.
In [1]:
try:
## For Network Analysis and Visualization
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from collections import defaultdict
import operator
## For Hierarchical Clustering
from scipy.cluster import hierarchy
from scipy.spatial import distance
## For Community Detection (Louvain Method)
import community
except:
import traceback
traceback.print_exc()
raise ImportError('Something failed, see above.')
Here, we use the Zachary's karate club network $K$ to illustrate some concepts on network structures. This network was first presented in [1].
[1] W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977).
In [3]:
K = nx.karate_club_graph()
In [4]:
print "Number of nodes: ", K.size()
print "Number of edges: ", K.order()
In [5]:
pos = nx.fruchterman_reingold_layout(K);
In [6]:
plt.figure(figsize=(8,8));
plt.axis("off");
nx.draw_networkx_nodes(K, pos, node_size=300, node_color="black");
nx.draw_networkx_edges(K, pos, alpha=0.500);
nx.draw_networkx_labels(K, pos, font_color="white");
plt.show();
In [7]:
mrhi = [0,1,2,3,4,5,6,7,8,10,11,12,13,16,17,19,21]
johna = [9,14,15,18,20,22,23,24,25,26,27,28,29,30,31,32,33]
for mem in K.nodes():
if mem in mrhi:
K.node[mem]["group"] = "mrhi"
else:
K.node[mem]["group"] = "johna"
In [8]:
plt.figure(figsize=(8,8));
plt.axis("off");
nx.draw_networkx_nodes(K,pos,
nodelist=mrhi,
node_color='r',
node_size=300,
alpha=0.8);
nx.draw_networkx_nodes(K,pos,
nodelist=johna,
node_color='b',
node_size=300,
alpha=0.8);
nx.draw_networkx_edges(K, pos, alpha=0.5);
nx.draw_networkx_labels(K, pos, font_color="white");
In [9]:
print list(nx.find_cliques(K))
The betweenness of an edge $e$ is the fraction of shortest-paths that course through it.
$$c_B(e)=\sum _{i,j \in V}\frac{\sigma (i,j)|e}{\sigma (i,j)},$$where $\sigma(i,j)$ is the total number of shortest paths and $\sigma (i,j)|e$ is the number of shortest paths that pass through edge $e$. In networkx
, the function is edge_betweenness_centrality(G)
. Below, we illustrate the function using Zachary's karate club $K$.
In [10]:
#order the dictionary by value, which is the edge betweeness of two nodes
ebet = nx.edge_betweenness_centrality(K)
sorted_ebet = sorted(ebet.items(), key=operator.itemgetter(1), reverse=True)
sorted_ebet[0:5]
Out[10]:
Iteratively remove edges with the highest edge betweenness. For purpose of illustration, we first make a copy of the karate club network $K$. We then perform $i=10$ iterations of edge removal.
In [11]:
K2 = K.copy()
def remove_top_ebet(K2):
ebet = nx.edge_betweenness_centrality(K2)
sorted_ebet = sorted(ebet.items(), key=operator.itemgetter(1), reverse=True)
edge_to_remove = sorted_ebet[0]
K2.remove_edge(*edge_to_remove[0])
return K2
for i in range(10):
K2=remove_top_ebet(K2)
In [12]:
nx.draw(K2, pos)
nx.draw_networkx_labels(K2, pos)
plt.show()
In [13]:
for i in range(5):
K2=remove_top_ebet(K2)
nx.draw(K2, pos)
nx.draw_networkx_labels(K2, pos)
plt.show()
The algorithm for finding "communities" based on edge betweenness removal was presented by Girvan-Newmann in their 2002 PNAS paper. Based on the method implemented, the algorithm is classified as a dividive procedure (as opposed to agglomerative, which is discussed in the next section. A code snippet taken here is shown below.
In [14]:
print "Number of components: ", len(list(nx.connected_components(K2))),"\n"
print list(nx.connected_components(K2))
In this method, we start with a distance matrix $D$, which is a matrix that provides the distance between two nodes $i$ and $j$ in network $G$. Using $D$, we then perform hierarchical clustering in a recursive manner, group similar nodes at each step. There are various ways to perform the similarity grouping using agglomerative procedures such as single, complete, and average, to name a few. In the example below, we implement the average/UPGMA linkage, which is the one implemented in the Ravasz Algorithm published in Science in 2002.
In the following cells, we perform hierarchical clustering and visualize the order in which the nodes are grouped together in communities.
In [15]:
__author__ = """\n""".join(['Erika Fille Legara <legareft@ihpc.a-star.edu.sg>',
'Maksim Tsvetovat <maksim@tsvetovat.org',
'Drew Conway <drew.conway@nyu.edu>',
'Aric Hagberg <hagberg@lanl.gov>'])
'''
The original code was written by Drew Conway and Aric Hagberg and was later
modified by Maksim Tsvetovat for his book titled “Social Network Analysis for Startups”.
EF Legara slightly modified Tsvetovat's version to address some issues
on the ordering of keys within the path_length dictionary. Moreover,
in this code, a different agglomerative procedure is implemented.
'''
def create_hc(G, t):
## Set-up the distance matrix D
labels=G.nodes() # keep node labels
path_length=nx.all_pairs_shortest_path_length(G)
distances=np.zeros((len(G),len(G)))
for u,p in path_length.items():
for v,d in p.items():
distances[G.nodes().index(u)][G.nodes().index(v)] = d
distances[G.nodes().index(v)][G.nodes().index(u)] = d
if u==v: distances[G.nodes().index(u)][G.nodes().index(u)]=0
# Create hierarchical cluster (HC)
# There are various other routines for agglomerative clustering,
# but here we create the HCs using the complete/max/farthest point linkage
Y = distance.squareform(distances) ## the upper triangular of the distance matrix
Z = hierarchy.average(Y)
# This partition selection (t) is arbitrary, for illustrive purposes
membership=list(hierarchy.fcluster(Z,t=t))
# Create collection of lists for blockmodel
partition = defaultdict(list)
for n,p in zip(list(range(len(G))),membership):
partition[p].append(labels[n])
return Z, membership, partition
In [16]:
Z, membership, partition = create_hc(K, t=1.15)
partition.items()
Out[16]:
In [17]:
partition = {}
i = 0
for i in range(len(membership)):
partition[i]=membership[i]
In [18]:
plt.figure(figsize=(10,10))
plt.axis('off')
nx.draw_networkx_nodes(K, pos, cmap=plt.cm.RdYlBu, node_color=partition.values())
nx.draw_networkx_edges(K, pos, alpha=0.5)
nx.draw_networkx_labels(K, pos)
plt.show()
In [19]:
plt.figure(figsize=(10,8))
hierarchy.dendrogram(Z)
plt.show()
The community detection algorithm that we implement in this notebook is a modularity-based algorithm. The modularity $M_c$ quantifies how good a "community" or partition is, and is given by
\begin{equation} M_c = \sum _{c=1} ^{n_c} \left[ \frac{L_c}{L} - \left (\frac{k_c}{2L} \right)^2\right] \end{equation}where $n_c$ is the number of communiies, $L_c$ is the number of links within the community, $k_c$ is the total degree of nodes belonging to the community, and $L$ is the total number of links in the network. The illustration below that is taken from the slide deck shows how modularity is calculated.
The higher the network modularity is, the more optimal the partitionioning. And, if you only have a single community, $M=0$.
In this recipe, we implement a modularity-maximization algorithm called the Louvain algorithm. We go back to Zachary's karate club and find the best partitions.
In [21]:
partition = community.best_partition(K)
Now, let's redraw the network; this time, highlighting the different communities by node color.
In [22]:
plt.figure(figsize=(10,10))
plt.axis('off')
nx.draw_networkx_nodes(K, pos, cmap=plt.cm.RdYlBu, node_color=partition.values())
nx.draw_networkx_edges(K, pos, alpha=0.5)
nx.draw_networkx_labels(K, pos)
plt.show()
The algorithm produces four (4) partitions. Let's compare this partition to the actual.
In [23]:
plt.figure(figsize=(8,8));
plt.axis("off");
nx.draw_networkx_nodes(K,pos, nodelist=mrhi, node_color='r',
node_size=300, alpha=0.8);
nx.draw_networkx_nodes(K,pos, nodelist=johna, node_color='b',
node_size=300, alpha=0.8);
nx.draw_networkx_edges(K, pos, alpha=0.5);
nx.draw_networkx_labels(K, pos, font_color="white");
There are two types of Marvel Universe networks that is available on the web, the Comic Network and the Hero Network. The Comic Network connects the characters to the commic issues they appeared in; on ther hand, the Hero Network represents the connections between characters as they appear in the same comic issue (weighted). In this notebook, we look at the Hero Network.
In [24]:
import unicodecsv as csv
M = nx.Graph(name="Marvel Universe")
reader = csv.reader(open("./Module 6 Datasets/hero-network.csv", 'rU'))
for row in reader:
M.add_edge(*row)
To export this network for use in Gephi, use
nx.write_gexf(M, "marvel_universe.gexf")
In [25]:
M.size() #number of edges
Out[25]:
In [26]:
M.order() #number of nodes
Out[26]:
In [27]:
M_cliques=list(nx.find_cliques(M))
In [28]:
print M_cliques[0:2]
In [29]:
partition = community.best_partition(M)
In [30]:
len(set(partition.values()))
Out[30]:
The code below may take a while to run since we are looking at a relatively large network, so we're commenting it out.'
plt.figure(figsize=(10,10))
plt.axis('off')
pos = nx.spring_layout(M)
nx.draw_networkx_nodes(M, pos,cmap=plt.cm.RdYlBu, node_color=partition.values())
nx.draw_networkx_edges(M, pos, alpha=0.4)
plt.savefig("Marvel Universe.png")
But this is how it looks using Gephi.
What are these communities? Which community is the largest? Sneak a look into the communities.
In [31]:
comm = defaultdict(list)
for k, v in partition.items():
comm[v].append(k)
In [32]:
print comm.items()[11]
In this example, we look at the word co-occurrence dataset used in the paper titled "News Framing of Population and Family Planning Issues via Syntactic Network Analysis" by E.F. Legara et al. Using a community detection algorithm, we looked for word patterns (how the words were 'arranged' or connected in the sentences of each news article) to evaluate how the media framed the population issue in the Philippines through the use of the different labels to refer to it in public discourse.
In [33]:
F = nx.read_gml("./Module 6 Datasets/word-net.gml", label="label")
In [34]:
# nx.write_gexf(F, "framing.gexf")
In [35]:
partition = community.best_partition(F, weight="weight")
Commenting this out because the network is so dense, it doesn't say much when plotted.
plt.figure(figsize=(10,10))
plt.axis('off')
pos = nx.spring_layout(F, iterations=150)
nx.draw_networkx_nodes(F, pos,cmap=plt.cm.RdYlBu, node_color=partition.values(), node_size=10, label=True)
nx.draw_networkx_edges(F, pos, alpha=0.4)
plt.savefig("Frames.png")
Instead, let's view the communities by printing out the words.
In [36]:
coms = defaultdict(list)
for k in partition.keys():
coms[partition[k]].append(k)
In [37]:
print coms[0], "\n", coms[1], "\n", coms[2], "\n", coms[3], "\n", coms[4]
In [38]:
# P = nx.read_gml("./Module 6 Datasets/polblogs.gml")
P = nx.read_gml("./Module 6 Datasets/polblogs copy.gml")
In [39]:
P.nodes(data=True)[0:2]
Out[39]:
In [40]:
P.order(), P.size()
Out[40]:
In [41]:
pos = nx.spring_layout(P)
nx.draw(P, pos, node_size=30)
plt.show()
In [42]:
PR = P.to_undirected()
PR = nx.Graph(PR)
In [43]:
mypalette = ["blue","red","green", "yellow", "orange", "violet", "grey", "grey","grey"]
In [44]:
pos = nx.spring_layout(PR)
#colors = [mypalette[PR.node[i]['value']] for i in range(1,len(PR.nodes()))]
colors = [mypalette[PR.node[i]['value']] for i in PR.nodes()]
nx.draw(PR, pos, node_color=colors, node_size=10)
plt.show()
In [45]:
Gcc=sorted(nx.connected_component_subgraphs(PR), key = len, reverse=True)
GC=Gcc[0]
In [46]:
partition = community.best_partition(GC)
In [47]:
print "Number of Communities: ", len(set(partition.values()))
In [48]:
for k, v in partition.items():
GC.node[k]["louvain-val"] = v
colors = [mypalette[GC.node[node]["louvain-val"]] for node in GC.nodes()]
In [49]:
plt.figure(figsize=(10,10))
plt.axis('off')
pos = nx.spring_layout(GC, scale=3)
nx.draw_networkx_nodes(GC, pos, node_color=colors, node_size=20, label=True)
nx.draw_networkx_edges(GC, pos, alpha=0.4)
plt.savefig("polblogs.png")
In [ ]:
In [ ]: